1 Introduction

DNABarcodeCompatibility works in 2 steps:

2 Illumina barcode set from TruSeq small RNA kits

2.1 Finding compatible combinations of barcodes: simulations relative to Step1 of the workflow

To find compatible combinations of DNA barcodes, one approach is to generate all possible combinations of DNA barcodes and perform an exhaustive search of compatible combinations. However, the total number of combinations may become prohibitively large for large barcode sets and large multiplexing numbers. Thus, an exhaustive search of compatible combinations among all possible combinations may not be possible and in fact is not usually useful for such large datasets. An alternative strategy is to choose a fixed number of randomly selected combinations for finding compatible combinations. To determine the conditions for which a random search is preferable over an exhaustive search, we consider the execution time of the program as a critical parameter.

2.2 Analyzing barcode-set properties: set parameters for the simulations

The execution time of the program depends on the size of the barcode dataset, the multiplex level and the Illumina chemistry used. We therefore test different values of these parameters to vizualise how they influence the execution time.

Table 2.2: parameters used in the simulations

Number of simulations per condition Barcode-set size Multiplex level Chemistry
10 12, 18, 24, 30, 36, 42, 48 2, 3, 4, 5 1, 2, 4
Number of simulations
840

2.2.1 Running simulations for exhaustive searches

2.2.2 Execution time of an exhaustive search as a function of the total number of combinations

The execution time of an exhaustive search increases linearly with the total number \({n \choose k}\) of combinations. The constant of proportionality corresponds to the average computation time required to check compatibility of a given combination. This time is slighty faster for the 4-channel chemistry. For the ease of use, we fix to a few seconds the time of execution. On a standard computer, this corresponds to an exhaustive search among 2000 combinations for the 4-channel chemistry. Above 2000 combinations, the program will perform a search of compatible barcodes from a randomly generated set of barcode combinations. Other important aspects also motivate this choice (see below).

2.2.3 Execution time of an exhaustive search as a function of the barcode-set size

The total number of combinations (compatible or not) is given by the binomial coefficient \({n \choose k}\) where \(n\) is the number of barcodes in the dataset, referred to as the barcode-set size, and \(k\) is the mutliplex level (\(n\) and \(k\) being positive integers). We now plot the execution time as function of the barcode-set size for various multiplex levels. This reveals that the execution time is less than 2 seconds for a multiplex level of 2 with the 48-barcode Illumina small RNA TruSep kits. Therefore, in these conditions, it is advantageous to perform an exhaustive search of compatible combinations for a multiplex level of 2. For a multiplex level of 4 or higher, the program will typically perform a search of compatible barcodes from a randomly generated set of barcode combinations.

2.2.4 Probability to find compatible barcodes in the total ensemble of combinations

  • The fraction of compatible combinations among the total set of \({n \choose k}\) possible combinations show high variability for small randomly-selected barcode sets

Note that the probability that a randomly chosen combination of \(k\) barcodes among \(n\), which are themselves chosen randomly among the 48 barcodes of the Illumina set, is independent of \(n\), consistent with the observations. This probability is moreover fairly independent on the chemistry for this particular barcode set.

  • The probability to find compatible combinations increases with the multiplex level

  • Relationship between the number of compatibles combinations and the total number of combinations

For a multiplex level of 2, the number of compatible combinations is less than 100 for the combined Illumina TruSeq small RNA kits. Therefore, for a multiplex level of 2, an exhaustive search is preferable to retrieve all possible compatible combinations of barcodes for the optimization steps. We fix to 2024 the total number of combinations beyond which a (much faster) random search of compatible combinations is performed (see dashed line in the graph below). This number is motivated by two reasons. Below this threshold, an exhaustive search will be systematically performed for a multiplex level of 2 (less than 2024 combinations in total). Above this threshold, a random search will provide enough compatible combinations for futher optimization steps. Indeed, the above results indicate that the probability to get a compatible combination is at least 18% for multiplex levels >= 3. Briefly, the random search will randomly pick up an arbitrarily fixed number of 1000 combinations representing a pool of at least 180 compatible combinations, which is enough for the further optimization steps to be efficient (see below). Thus, the DNABarcodeCompatibility::experiment_design() function of the package will use a threshold of 2024 combinations in total to switch between an exhaustive search and a random search of compatible combinations.

2.2.5 Compatible barcode combinations: average barcode frequencies among input barcode sets

The frequencies of the various barcodes appearing in the compatible combinations generated above may be highly heterogeneous for small multiplex levels. We plot here the distribution of barcodes averaged over the different barcode sets, for distinct chemistries and multiplex levels. Despite the averaging of the distibutions over different barcode sets of different sizes, the heterogeneity between barcode frequencies is clearly apparent.

  • 4-channel chemistry

  • 2-channel chemistry

  • 1-channel chemistry

2.2.6 Compatible barcode combinations: single examples

Individual sets of 12 barcodes were randomly generated from the four Illumina TruSeq small RNA kits comprising 48 barcodes in total. We see that barcode frequencies can be very heterogeneous even for multiplex levels of 4 or 5. Therefore, without further optimization, consumable usage may be strongly biased among preparation kits.

  • 4-channel chemistry, multiplex level 4, barcode-set size 12

  • 4-channel chemistry, multiplex level 5, barcode-set size 12

  • 2-channel chemistry, multiplex level 4, barcode-set size 12

2.3 Optimizing combinations: set parameters for simulations relative to Step2 of the workflow

  • Step2: optimizing the set of compatible barcodes to minimize the heterogeneity of barcode frequencies

The goal here is to estimate the probability to get an optimal solution, in the sense of a maximum entropy solution, and to test how the optimizer improves the solution compared with a random pick.

We compare the greedy search algorithms implemented in the DNABarcodeCompatibility R-package: the greedy descent and the greedy exchange algorithms (described in supplementary file S2). We use different subsets of randomly-generated compatible combinations, and for each of those we perform 100 simulations with both algorithms, in order to estimate the probability \(P_{1}\) to get an optimal solution. This optimal solution consists of a selection of compatible barcode combinations in which the barcode distribution is close to uniform. The number of barcode appearing in the optimal solution is the product of the multiplex level by the number of lanes, which defines the number of libraries to be sequenced.

Based on this probability \(P_{1}\), we devise an optimizer function that performs \(R\) trials of the greedy search. The probability to get an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).

  • We therefore run two series of simulations:

Step2a: find the probability \(P_{1}\) to get an optimal combination given the subset size, and estimate \(R\) such that \(P_{R} \geq 0.95\), in order to refine the optimization in a reasonable computation time.

Step2b: check the efficacy of the optimizer function on a few challenging examples

2.3.1 Step2a: estimating \(P_{1}\)

Table 2.3.1: parameters used for the simulations of step2a

Number of simulations per condition Barcode-set size Multiplex level Chemistry Lane number Algorithm Compatible combinations (GD) Compatible combinations (GE)
100 12, 18, 24, 30 2, 3, 4, 5 2, 4 4, 6, 8 greedy_exchange, greedy_descent 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 400
Number of simulations
3.16810^{5}

2.3.1.1 Running simulations for step2a (without iteration)

2.3.1.2 Execution time of a single greedy search run for different sizes of compatible combination subsets

We perform 100 repetitions of the greedy search for the conditions (described in table 2.3.1 above). When the set of compatible combinations is large, the computation time, required to run these greedy algorithms on the full set of compatible combinations generated in step1, may become prohibitively large. To reduce the computation time, we apply the greedy search on a randomly generated subset of 10 \(\leq N_{comp}\leq\) 400 compatible combinations, for a barcode set size comprized between 12 and 30, and a multiplex level between 2 and 5. Here, we want to know how the size of the subset influences the computation time for each of the greedy search algorithm.

We averaged the execution time over 100 replicates of different combinations of conditions of chemistry, barcode-set size, number of lanes, and multiplex level. Part of the variability observed for the execution time comes from the diversity of conditions that may substantially influence the execution time.

The graph below indicates that the running time of the greedy-descent algorithm is quadratic as a function of the barcode-set size, which is expected, since the number of operations required to complete this algorithm on a set of \(N_{comp}\) compatible combinations is \(\mathcal{O}(N_{comp}^2)\). In contrast, the computation time of the greedy-exchange algorithm is linear (\(\mathcal{O}(N_{comp})\)).

2.3.1.5 Estimation of the execution time required for R iterations of a greedy search, with R chosen to ensure 95% confidence of obtaining a maximum entropy barcode selection

The probability of getting an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).

  • Maximum execution time (y-axis) and number of iterations R (color coded) plotted as a function of the number of compatible barcode combinations used for the search, grouped by chemistry, algorithm, barcode-set size, and multiplex level

The results shown in the graphs below are heterogeneous between conditions. However, common trends emerge. First, in general, the greedy-exchange search performs the optimization faster over R iterations than the greedy-descent search, except in a few cases (see next point) where the number \(N_{comp}\) of compatible combinations is smaller than 30. Second, when \(N_{comp}\) is larger than 120 (only tested for the greedy-exchange algorithm), the optimization requires less iterations in general and performs faster. Based on these observations, it seems reasonable to fix the compatible-combination subset size to 120.

  • Number of iterations required to find the optimal selection with 95 % confidence, and corresponding ratio between greedy-descent and greedy-exchange execution times, as a function of the number of lanes \(a\), grouped by algorithm, multiplex level, and number of compatible barcodes used for the search

In the graphs below, data are now pooled regardless chemistry and barcode-set sizes. The number displayed in the right grey panel represents the size the randomly chosen subset of compatible combinations. We confirm that the greedy-descent search performs faster for small subsets of compatible combinations. For subsets larger than 30 compatible combinations, it seems that 50 iterations would be a reasonable number for an efficient and fast optimization by the greedy-exchange search.

2.3.1.6 Distributions of entropy values without (S_random) and with (S_opt) optimization

To visualize the efficacy of the greedy search algorithms, we compare the distribution of entropy values obtained from randomly generated selections of compatible barcodes with the distribution of entropy values obtained from optimized selections of compatible barcodes. The graphs below show that both greedy algorithms are comparable and dramatically reduce the barcode frequencies’ heterogeneity by favoring selections of compatible barcode combinations with a higher entropy. One graph table is shown per subset size. The blue line indicates the position of the theoretical value of the maximum entropy \(S_{max}\). A dashed line indicates that \(S_{max}\) has been reached, whereas a solid line indicates that \(S_{max}\) has not been reached.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

2.3.2 Step2b: table of parameters used for the simulations

The results of the simulations in step2a suggest a few cases for which the optimization procedure would require more than 100 iterations to reach \(S_{max}\). It is instructive to use these challenging conditions to test how good our greedy-exchange algorithm would perform compared to a basic optimization by iterative random picks. In the table below are summarized a few challenging conditions to be tested.

Number of simulations per condition Barcode-set size Multiplex level Chemistry Lane number Algorithm Compatible combinations (GD) Compatible combinations (GE) Max iterations R
100 12, 30 4, 5 4 6 greedy_exchange 120 200
Number of simulations
400

2.3.2.1 Running simulations for step2b with iterations

2.3.2.2 Distributions of entropy values after running the optimizer

We compare the distributions of entropy values obtained using 3 different strategies over a maximum of 200 iterations per trial. We performed 100 trials for each strategy. We calculate and compare the 3 resulting entropy values:

  1. S_random: no optimization; a subset of compatible combinations is randomly picked up only once (no iteration)
  2. S_opt: a subset of compatible combinations is optimized using the greedy-exchange algorithm, stopping when a selection of entropy \(S_{max}\) is obtained, or when the iteration number reaches 200.
  3. S_random_opt: a subset of compatible combinations is optimized by random pick using the same amount of iterations used with the greedy search.
  • Set of 120 compatible combinations and up to 200 iterations per trial

In the graphs below, the blue line indicates the position of the theoretical value of the maximum attainable entropy \(S_{max}\). Solid lines indicate that \(S_{max}\) has not been reached over 200 iterations. Either the probability to find \(S_{max}\) is too low to be reached in 200 iterations, or \(S_{max}\) cannot be reached at all due to constraints in the barcode set. Furthermore, the dashed line indicates that \(S_{max}\) has been reached.

In all cases, the greedy search algorithm performs better than a random search, and reaches a value close to the best entropy value.

## [[1]]

  • Average execution time per condition
Conditions Set of 12 barcodes Set of 30 barcodes
mplex-4 245 sec 253 sec
mplex-5 1 sec 291 sec

2.3.2.3 Comparison of barcode distributions between a randomly generated set and an optimized selection of compatible combinations

As above \(n\), \(k\) and \(a\) denote the barcode set size, the multiplex level, and the number of lanes, respectively

2.3.2.3.1 Case A: \(n = 30\), \(k = 4\), \(a = 6\), \(S_{max}\) is reached
  • Barcode set obtained:
Barcode set
RPI03, RPI04, RPI06, RPI07, RPI08, RPI09, RPI11, RPI12, RPI14, RPI15, RPI16, RPI17, RPI18, RPI21, RPI22, RPI23, RPI27, RPI31, RPI32, RPI33, RPI34, RPI35, RPI36, RPI37, RPI38, RPI40, RPI43, RPI46, RPI47, RPI48
  • Selection of compatible combinations that was randomly chosen

  • Set of compatible barcodes that was optimized with the iterative greedy-exchange search

We use the same experimental conditions as above for a fair comparison. Here, the number of libraries \(N=a.k=24\) is smaller than the barcode set size (\(n=30\)). The optimal selection has a uniform distribution of the barcodes used (of entropy \(S_{max} = log(N)\)), and is reached by the greedy algorithms.

2.3.2.3.2 Case B: \(n = 30\), \(k = 5\), \(a = 6\), \(S_{max}\) is not reached
  • Barcode set obtained:
Barcode set
RPI03, RPI04, RPI06, RPI07, RPI08, RPI09, RPI11, RPI12, RPI14, RPI15, RPI16, RPI17, RPI18, RPI21, RPI22, RPI23, RPI27, RPI31, RPI32, RPI33, RPI34, RPI35, RPI36, RPI37, RPI38, RPI40, RPI43, RPI46, RPI47, RPI48
  • Selection of compatible combinations that was randomly chosen

  • Set of compatible barcodes that was optimized with the iterative greedy-exchange search

We use the same experimental conditions as above for a fair comparison. Here, the number of libraries \(N = a.k = 30\) is equal to the barcode set size \(n\), and the maximum possible value of the entropy of the barcode selection is the entropy of the uniform distribution, \(S_{max}=log(n)\). Such a selection, exhausting the set of available barcodes, would correspond to a partition of this set in \(a=6\) disjoint combinations of \(k=5\) barcodes. However, no such partition exists for the set of compatible combinations at hand, which allows one to form selections containing a maximum of 29 barcodes. Hence, a uniform barcode distribution cannot be attained. Nevertheless, the greedy algorithm still finds a selection having the maximum attainable entropy, \(S_{opt} = 3.35\), corresponding to the value of \(S_{max}\) obtained (Eq. 1 in Supplementary File S2) for \(N = 30\) and \(n = 29\).

2.3.2.3.3 Case C: \(n = 12\), \(k = 5\), \(a = 6\), \(S_{max}\) is reached
  • Barcode set obtained:
Barcode set
RPI01, RPI06, RPI07, RPI10, RPI15, RPI16, RPI28, RPI31, RPI41, RPI42, RPI46, RPI47
  • Selection of compatible combinations that was randomly chosen

  • Set of compatible barcodes that was optimized with the iterative greedy-exchange search

We use the same experimental conditions as above for a fair comparison. Here, the number of libraries (\(N=a.k=30\)) is greater than the barcode set size (\(n=12\)). A selection with uniform barcode distribution cannot be reached, however, because \(N\) is not divisible by \(n\). However, the selection having the maximum possible entropy \(S_{max}\) is reached.

2.3.2.3.4 Case D: \(n=12\), \(k=4\), \(a=6\), \(S_{max}\) is not reached
  • Barcode set obtained:
Barcode set
RPI01, RPI06, RPI07, RPI10, RPI15, RPI16, RPI28, RPI31, RPI41, RPI42, RPI46, RPI47
  • Selection of compatible combinations that was randomly chosen

  • Set of compatible barcodes that was optimized with the iterative greedy-exchange search

We use the same experimental conditions as above for a fair comparison.

3 Randomly-generated barcode set from the DNABarcodes package

3.1 Generating a barcode set with the DNABarcodes package from Tilo Buschmann

We use the DNABarcodes package from Tilo Buschmann to generate a new set of 6-mer barcodes, which are robust against at least one substitution error (Hamming distance of 3). In addition, barcodes including stretches of 4 or more nucleotides (homopolymer of length >3) are discarded as well as those that have an unbalanced ratio of bases G or C versus A or T. We then exclude barcodes that are present both in the Illumina TruSeq small RNA kits and in the newly generated set.

library("DNABarcodes")
## Loading required package: Matrix
## Loading required package: parallel
newBarcodesRaw <- create.dnabarcodes(n=6, dist=3, metric = "hamming", filter.triplets = F, filter.gc = T, filter.self_complementary = F)
## 1) Creating pool ...  of size 1280
## 2) Conway closing...  done
newBarcodes <- newBarcodesRaw[!(newBarcodesRaw %in% DNABarcodeCompatibility::IlluminaIndexes$sequence)]

newBarcodeSet <- data.frame(Id=paste0("B",sprintf("%02d",1:length(newBarcodes))), sequence=newBarcodes, GC_content=50, homopolymer=FALSE, stringsAsFactors = F)[1:48,]

newBarcodeSet[,1:2]
##     Id sequence
## 1  B01   GGGAAA
## 2  B02   CCCAAA
## 3  B03   CGAGAA
## 4  B04   ACGGAA
## 5  B05   GACGAA
## 6  B06   GCACAA
## 7  B07   CAGCAA
## 8  B08   AGCCAA
## 9  B09   TCGAGA
## 10 B10   GTCAGA
## 11 B11   CGTAGA
## 12 B12   TGACGA
## 13 B13   ATGCGA
## 14 B14   GATCGA
## 15 B15   CCATGA
## 16 B16   CTGACA
## 17 B17   TGCACA
## 18 B18   GCTACA
## 19 B19   TCAGCA
## 20 B20   ATCGCA
## 21 B21   CATGCA
## 22 B22   GGATCA
## 23 B23   TGGGTA
## 24 B24   TCCCTA
## 25 B25   GCGTTA
## 26 B26   CGCTTA
## 27 B27   TAGGAG
## 28 B28   AGTGAG
## 29 B29   TTCCAG
## 30 B30   CTGTAG
## 31 B31   ACCTAG
## 32 B32   GAAAGG
## 33 B33   ACTAGG
## 34 B34   TTTGGG
## 35 B35   AGATGG
## 36 B36   TACTGG
## 37 B37   AAGACG
## 38 B38   TAACCG
## 39 B39   ATTCCG
## 40 B40   GATTCG
## 41 B41   CGAATG
## 42 B42   GTGATG
## 43 B43   AACCTG
## 44 B44   TGTCTG
## 45 B45   CCTTTG
## 46 B46   TCTGAC
## 47 B47   CTACAC
## 48 B48   GAGTAC

3.2 Analyzing barcode-set properties: set parameters for the simulations

Number of simulations per condition Barcode-set size Multiplex level Chemistry
10 12, 18, 24, 30, 36, 42, 48 2, 3, 4, 5 1, 2, 4
Number of simulations
840

3.2.1 Running simulations for exhaustive searches

3.2.2 Probability to find compatible barcodes in the total ensemble of combinations

  • The probability to find compatible combinations is more variable for small randomly-selected barcode sets

  • The probability to find compatible combinations increases with the multiplex level

  • Relationship between the number of compatibles combinations and the total number of combinations

For a multiplex level of 2, the number of compatible combinations is even smaller than for the combined Illumina TruSeq small RNA kits. Therefore, for a multiplex level of 2, an exhaustive search is preferable to retrieve all possible compatible combinations of barcodes for the optimization steps. The dashed line in the graph below indicates the threshold (2024) over which the DNABarcodeCompatibility::experiment_design() function of the package will perform a random search of compatible combinations instead of an exhaustive search.

3.2.3 Compatible barcode combinations: average barcode frequencies among input barcode sets

The frequencies of the various barcodes appearing in the compatible combinations generated above may be highly heterogeneous for small multiplex levels. We plot here the distribution of barcodes averaged over the different barcode sets, for distinct chemistries and multiplex levels. Despite the averaging of the distibutions over different barcode sets of different sizes, the heterogeneity between barcode frequencies is clearly apparent.

  • 4-channel chemistry

  • 2-channel chemistry

  • 1-channel chemistry

3.3 Optimizing combinations: set parameters for the simulations

Number of simulations per condition Barcode-set size Multiplex level Chemistry Lane number Algorithm Compatible combinations (GD) Compatible combinations (GE)
100 12, 18, 24, 30 3, 4, 5 2, 4 4, 6, 8 greedy_exchange 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300
Number of simulations
1.0810^{5}

3.3.1 Running simulations for the optimization without iteration

3.3.2 Probability to get the best optimized set of barcode combinations for a single greedy search

In some cases, the probability to reach the theoretical maximum entropy \(S_{max}\) might be \(\leq 10^{-2}\) or even be zero. Here, to take account of such cases, we estimate the probability to get the highest value of the entropy \(S_{opt}\) where possibly \(S_{opt} < S_{max}\). In these conditions, the barcode distribution won’t necessarily be uniform, but still much less heterogeneous than it would be for a selection obtained by a random pick (see below).

3.3.3 Estimation of the maximum execution time (R iterations) of a greedy search to obtain the max entropy in 95% of cases

The probability of getting an optimal solution after \(R\) iterations reads \(P_{R} = 1-(1-P_{1})^R\). For example, if we want to find an optimal solution at least 95% of cases \((P_{R} = 0.95)\), we need a number of iterations \(R \geq log(1-P_{R})/log(1-P_{1})\).

  • Maximum execution time (y-axis) and number of iterations R (color coded) plotted as a function of the number of compatible barcode combinations used for the search, grouped by chemistry, algorithm, barcode-set size, and multiplex level

The results shown in the graphs below are heterogeneous between conditions. However, common trends emerge. First, in general, the greedy-exchange search performs the optimization faster over R iterations than the greedy-descent search, except in a few cases (see next point) where the number \(N_{comp}\) of compatible combinations is smaller than 30. Second, when \(N_{comp}\) is larger than 120 (only tested for the greedy-exchange algorithm), the optimization requires less iterations in general and performs faster. Based on these observations, it seems reasonable to fix the compatible-combination subset size to 120.

  • Number of iterations required to find the optimal selection with 95 % confidence, and corresponding ratio between greedy-descent and greedy-exchange execution times, as a function of the number of lanes \(a\), grouped by algorithm, multiplex level, and number of compatible barcodes used for the search

In the graphs below, data are now pooled regardless chemistry and barcode-set sizes. The number displayed in the right grey panel represents the size the randomly chosen subset of compatible combinations. We confirm that the greedy-descent search performs faster for small subsets of compatible combinations. For subsets larger than 30 compatible combinations, it seems that 50 iterations would be a reasonable number for an efficient and fast optimization by the greedy-exchange search.

3.3.4 Distributions of entropy values without (S_random) and with (S_opt) optimization

To visualize the efficacy of the greedy search algorithms, we compare the distribution of entropy values obtained from randomly generated selections of compatible barcodes with the distribution of entropy values obtained from optimized selections of compatible barcodes. The graphs below show that both greedy algorithms are comparable and dramatically reduce the barcode frequencies’ heterogeneity by favoring selections of compatible barcode combinations with a higher entropy. One graph table is shown per subset size. The blue line indicates the position of the theoretical value of the maximum entropy \(S_{max}\). A dashed line indicates that \(S_{max}\) has been reached, whereas a solid line indicates that \(S_{max}\) has not been reached.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]